Search CORE

1,556 research outputs found

Recommended from our members

How to Get the Most out of Your Curation Effort

Author: Rzhetsky Andrey
Shatkay Hagit
Wilbur W. John
Publication venue
Publication date: 21/12/2023
Field of study

Large-scale annotation efforts typically involve several experts who may disagree with each other. We propose an approach for modeling disagreements among experts that allows providing each annotation with a confidence value (i.e., the posterior probability that it is correct). Our approach allows computing certainty-level for individual annotations, given annotator-specific parameters estimated from data. We developed two probabilistic models for performing this analysis, compared these models using computer simulation, and tested each model's actual performance, based on a large data set generated by human annotators specifically for this study. We show that even in the worst-case scenario, when all annotators disagree, our approach allows us to significantly increase the probability of choosing the correct annotation. Along with this publication we make publicly available a corpus of 10,000 sentences annotated according to several cardinal dimensions that we have introduced in earlier work. The 10,000 sentences were all 3-fold annotated by a group of eight experts, while a 1,000-sentence subset was further 5-fold annotated by five new experts. While the presented data represent a specialized curation task, our modeling approach is general; most data annotation studies could benefit from our methodology.</p

Knowledge UChicago

Preliminary investigation of the flow in an annular-diffuser-tailpipe combination with an abrupt area expansion and suction, injection, and vortex-generator flow controls

Author: Henry John R
Wilbur Stafford W
Publication venue
Publication date
Field of study

NASA Technical Reports Server

Automatically Identifying Gene/Protein Terms in MEDLINE Abstracts

Author: Hatzivassiloglou Vasileios
Rzhetsky Andrey
Wilbur W John
Yu Hong
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 31/10/2002
Field of study

Motivation. Natural language processing (NLP) techniques are used to extract information automatically from computer-readable literature. In biology, the identification of terms corresponding to biological substances (e.g., genes and proteins) is a necessary step that precedes the application of other NLP systems that extract biological information (e.g., protein–protein interactions, gene regulation events, and biochemical pathways). We have developed GPmarkup (for “gene/protein-full name mark up”), a software system that automatically identifies gene/protein terms (i.e., symbols or full names) in MEDLINE abstracts. As a part of marking up process, we also generated automatically a knowledge source of paired gene/protein symbols and full names (e.g., LARD for lymphocyte associated receptor of death) from MEDLINE. We found that many of the pairs in our knowledge source do not appear in the current GenBank database. Therefore our methods may also be used for automatic lexicon generation. Results. GPmarkup has 73% recall and 93% precision in identifying and marking up gene/protein terms in MEDLINE abstracts.Availability: A random sample of gene/protein symbols and full names and a sample set of marked up abstracts can be viewed at http://www.cpmc.columbia.edu/homepages/yuh9001/GPmarkup/

CiteSeerX

Elsevier - Publisher Connector

Columbia University Academic Commons

Summary of subsonic-diffuser data

Author: Henry John R
Wilbur Stafford W
Wood Charles C
Publication venue
Publication date
Field of study

Subsonic-diffuser data - exit flow, boundary-layer control, and inlet velocit

NASA Technical Reports Server

SplicePort—An interactive splice-site analysis tool

Author: Dogan Rezarta Islamaj
Getoor Lise
Mount Stephen M.
Wilbur W. John
Publication venue: Oxford University Press
Publication date: 01/01/2007
Field of study

SplicePort is a web-based tool for splice-site analysis that allows the user to make splice-site predictions for submitted sequences. In addition, the user can also browse the rich catalog of features that underlies these predictions, and which we have found capable of providing high classification accuracy on human splice sites. Feature selection is optimized for human splice sites, but the selected features are likely to be predictive for other mammals as well. With our interactive feature browsing and visualization tool, the user can view and explore subsets of features used in splice-site prediction (either the features that account for the classification of a specific input sequence or the complete collection of features). Selected feature sets can be searched, ranked or displayed easily. The user can group features into clusters and frequency plot WebLogos can be generated for each cluster. The user can browse the identified clusters and their contributing elements, looking for new interesting signals, or can validate previously observed signals. The SplicePort web server can be accessed at http://www.cs.umd.edu/projects/SplicePort and http://www.spliceport.org

CiteSeerX

PubMed Central

Stereospecific aliphatic hydroxylation upon photoreduction of iron (III)

Author: Groves John T.
Swanson Wilbur W.
Publication venue: 'Elsevier BV'
Publication date: 01/01/1975
Field of study

Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/22203/1/0000634.pd

Crossref

Deep Blue Documents at the University of Michigan

GENETAG: a tagged corpus for gene/protein named entity recognition

Author: Matten Wayne
Tanabe Lorraine
Thom Lynne H
Wilbur W John
Xie Natalie
Publication venue: BioMed Central
Publication date: 01/01/2005
Field of study

Crossref

Springer - Publisher Connector

PubMed Central

Comprehensively identifying Long Covid articles with human-in-the-loop machine learning

Author: Allot Alexis
Chen Qingyu
Islamaj Rezarta
Leaman Robert
Lu Zhiyong
Wilbur W. John
Publication venue
Publication date: 28/10/2022
Field of study

A significant percentage of COVID-19 survivors experience ongoing multisystemic symptoms that often affect daily living, a condition known as Long Covid or post-acute-sequelae of SARS-CoV-2 infection. However, identifying scientific articles relevant to Long Covid is challenging since there is no standardized or consensus terminology. We developed an iterative human-in-the-loop machine learning framework combining data programming with active learning into a robust ensemble model, demonstrating higher specificity and considerably higher sensitivity than other methods. Analysis of the Long Covid collection shows that (1) most Long Covid articles do not refer to Long Covid by any name (2) when the condition is named, the name used most frequently in the literature is Long Covid, and (3) Long Covid is associated with disorders in a wide variety of body systems. The Long Covid collection is updated weekly and is searchable online at the LitCovid portal: https://www.ncbi.nlm.nih.gov/research/coronavirus/docsum?filters=e_condition.LongCovi

arXiv.org e-Print Archive